Skip to content

Conversation

@leixm
Copy link
Contributor

@leixm leixm commented Oct 10, 2024

What changes were proposed in this pull request?

Support interrupt shuffle on client side.

I will develop the following functions in order

  1. Client supports interrupt shuffle
  2. Master supports calculating app-level shuffle usage

Why are the changes needed?

The current storage quota logic can only limit new shuffles, and cannot limit the writing of existing shuffles. In our production environment, there is such an scenario: the cluster is small, but the user's app single shuffle is large which occupied disk resources, we want to interrupt those shuffle.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unable to test this part independently, Additional tests will be added after completing the second part.

@leixm
Copy link
Contributor Author

leixm commented Oct 10, 2024

@AngersZhuuuu @RexXiong Can you help review?

@leixm leixm changed the title [WIP][CELEBORN-1577][Phase1] Storage quota should support interrupt shuffle. [CELEBORN-1577][Phase1] Storage quota should support interrupt shuffle. Oct 11, 2024
if (response.statusCode == StatusCode.SUCCESS) {
logDebug("Successfully send app heartbeat.")
workerStatusTracker.handleHeartbeatResponse(response)
checkQuotaExceeds(response.checkQuotaResponse)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a flag to tell spark whether cancel the stage when quota exceeds?

@leixm leixm force-pushed the CELEBORN-1577-1 branch 2 times, most recently from 926bfe8 to f07108e Compare October 17, 2024 09:59
new util.ArrayList[WorkerInfo](
(statusSystem.shutdownWorkers.asScala ++ statusSystem.decommissionWorkers.asScala).asJava)))
(statusSystem.shutdownWorkers.asScala ++ statusSystem.decommissionWorkers.asScala).asJava),
CheckQuotaResponse(isAvailable = true, "")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should actually check quota for the application.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently QuotaManager only support check quota for user, at next PR, we will support check quota for application, so we don’t need to modify rpc proto at this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@leixm
Copy link
Contributor Author

leixm commented Oct 25, 2024

cc @RexXiong PTAL.

Copy link
Contributor

@RexXiong RexXiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Only minor comments

def clientShuffleMapPartitionSplitEnabled = get(CLIENT_SHUFFLE_MAPPARTITION_SPLIT_ENABLED)
def clientChunkPrefetchEnabled = get(CLIENT_CHUNK_PREFETCH_ENABLED)
def clientInputStreamCreationWindow = get(CLIENT_INPUTSTREAM_CREATION_WINDOW)
def clientQuotaInterruptShuffleEnable = get(CLIENT_QUOTA_INTERRUPT_SHUFFLE_ENABLED)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clientQuotaInterruptShuffleEnable -》 clientQuotaInterruptShuffleEnabled

import org.apache.celeborn.common.rpc.{ClientSaslContextBuilder, RpcSecurityContext, RpcSecurityContextBuilder}
import org.apache.celeborn.common.rpc.netty.{LocalNettyRpcCallContext, RemoteNettyRpcCallContext}
import org.apache.celeborn.common.util.{JavaUtils, PbSerDeUtils, ThreadUtils, Utils}
// Can Remove this if celeborn don't support scala211 in future
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't remove this

.createWithDefault(1)

val CLIENT_QUOTA_INTERRUPT_SHUFFLE_ENABLED: ConfigEntry[Boolean] =
buildConf("celeborn.client.quota.interruptShuffle.enabled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should rename the configuration to celeborn.quota.interruptShuffle.enabled, as 'quota' represents an independent configuration namespace. It would also be appropriate to move this setting to the quota section.

@leixm
Copy link
Contributor Author

leixm commented Oct 28, 2024

cc @RexXiong Done, thanks for your review.


val QUOTA_INTERRUPT_SHUFFLE_ENABLED: ConfigEntry[Boolean] =
buildConf("celeborn.quota.interruptShuffle.enabled")
.categories("quota", "master", "client")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems this configuration never used at master

@codecov
Copy link

codecov bot commented Oct 30, 2024

Codecov Report

Attention: Patch coverage is 35.29412% with 11 lines in your changes missing coverage. Please review.

Project coverage is 32.27%. Comparing base (464a7c7) to head (c6618a7).
Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
...born/common/protocol/message/ControlMessages.scala 0.00% 10 Missing ⚠️
...cala/org/apache/celeborn/common/CelebornConf.scala 85.72% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2801      +/-   ##
==========================================
- Coverage   32.37%   32.27%   -0.09%     
==========================================
  Files         328      329       +1     
  Lines       19387    19483      +96     
  Branches     1747     1749       +2     
==========================================
+ Hits         6275     6287      +12     
- Misses      12769    12853      +84     
  Partials      343      343              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@RexXiong RexXiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! merge to main(v0.6.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants